Resource managers often provide for a periodic "node health check" to be
performed on each compute node to verify that the node is working
properly. Nodes which are determined to be "unhealthy" can be marked as down or
offline so as to prevent jobs from being scheduled or run on them. This helps
increase the reliability and throughput of a cluster by reducing preventable
job failures due to misconfiguration, hardware failure, etc. OpenHPC
distributes \nhc{} to fulfill this requirement.

In a typical scenario, the \nhc{} driver script is run periodically on each compute
node by the resource manager client daemon. It loads its
configuration file to determine which checks are to be run on the current node
(based on its hostname). Each matching check is run, and if a failure is
encountered, \nhc{} will exit with an error message describing the problem. It can
also be configured to mark nodes offline so that the scheduler will not assign
jobs to bad nodes, reducing the risk of system-induced job failures. 

% begin_ohpc_run
% ohpc_validation_newline
% ohpc_validation_comment Optionally, enable nhc and configure
\begin{lstlisting}[language=bash,keywords={},upquote=true]
# Install NHC on master and compute nodes
[sms](*\#*) (*\install*) nhc-ohpc
[sms](*\#*) (*\chrootinstall*) nhc-ohpc
\end{lstlisting}
% end_ohpc_run

